In this notebook, we use deep learning architectures for regression analysis and classification analysis. We begin with installing/ importing all the necessary packages. Then, we use a deep learning model for a regression task. Finally, we use a deep learning model for a classification task.
# # Install packages:
# !pip install pandas
# !pip install tensorflow
# !pip install keras
# !pip install seaborn
# !pip install category_encoders
# Import packages:
import sys
import pandas as pd
pd.set_option('display.max_rows', 50)
import numpy as np
np.set_printoptions(threshold=sys.maxsize)
import matplotlib.pyplot as plt
import seaborn as sns
from category_encoders import *
# sklearn imports:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
# keras imports:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from keras.backend import clear_session
Each row in the dataset corresponds to a property that was inspected and given a hazard score ("Hazard"). You can think of the hazard score as a continuous number that represents the condition of the property as determined by the inspection. Some inspection hazards are major and contribute more to the total score, while some are minor and contribute less. The total score for a property is the sum of the individual hazards.
The aim of the competition is to forecast the hazard score based on anonymized variables which are available before an inspection is ordered.
Source: https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction/data
df = pd.read_csv("data/insurance.csv")
df.head()
df.info()
df.columns
sns.pairplot(df[['Hazard','T1_V1', 'T1_V2' ]], diag_kind='kde')
cat_cols = list(df.select_dtypes(include=['O']).columns) # find columns of data type object (categorical variables)
encoder = BaseNEncoder(cols=cat_cols).fit(df)
df = encoder.transform(df)
df.head()
df.columns
X = df.drop(columns = ['Id', 'Hazard']) # Drop Id and Hazard from the list of predictors
Y = df.Hazard # This is what we want to predict
It is good practice to rescale features that use different scales and ranges.
One reason this is important is because the features are multiplied by the model weights. So the scale of the outputs and the scale of the gradients are affected by the scale of the inputs.
Although a model might converge without scaling, scaling the input data makes training much more stable.
scalar = MinMaxScaler()
scalar.fit(X)
X = scalar.transform(X)
After scaling our data, we can go ahead and split it to train and test subsets:
trainData, testData, trainLabels, testLabels = train_test_split(X, Y, test_size=.2, random_state = 1)
Now, we train the neural network. We are using the input variables ('T1_V1', 'T1_V2', 'T1_V3', ...), along with two hidden layers of 100 and 50 neurons respectively, and finally using the linear activation function to process the output.
# Clear the previous model:
clear_session()
model = Sequential()
model.add(Dense(71, # Number of nodes in the first layer
input_dim=71, # Number of inputs (predictors)
activation='relu' # Activation function for this layer
))
model.add(Dense(50, activation='relu'))
model.add(Dense(1, activation='linear'))
model.summary()
After designing the network architecture, we can go ahead and compile it:
model.compile(loss='mse', optimizer='adam', metrics=['mse','mae'])
And finally, we can go ahead and fit the model using the train data:
history = model.fit(trainData, trainLabels, # Data to be used for fitting/ training the model
epochs=150, # Number times that the learning algorithm will work through the training data
batch_size=50, # Number of samples to be used in each iteration
verbose=1, # Whether to print the progress
validation_split=0.2 # The portion of samples to be used for validation (different from our test data)
)
We can now visualize the loss (MSE) for training and validation data over the epochs:
print(history.history.keys())
# "Loss"
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
And finally, we can use the model to make predictions over our test data:
test_predictions = model.predict(testData).flatten()
a = plt.axes(aspect='equal')
plt.scatter(testLabels, test_predictions)
plt.xlabel('True Values [Hazard]')
plt.ylabel('Predictions [Hazard]')
lims = [Y.min(), Y.max()]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)
In the example below, we use U.S. Adult Salary data set to predict whether a household’s income is above or below $50K. The data is obtained from U.S. Census and includes the following columns:
age: Age of the head of household.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous. Not useful for the analysis and should be excluded.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous. Contains the same information in education, but represented by numbers instead of characters.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. Not usefule for the analysis and should be excluded.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss:continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
salary: Whether the household makes over $50K or not.
Missing values are noted in the data using ?. For more information about this dataset, please refer to: https://archive.ics.uci.edu/ml/datasets/adult
We go ahead and import the data. As we import the data in Pandas, we make sure we indicate the symbol for missing values. We also change the column names to make them more consistent:
data = pd.read_csv('data/adult.csv', na_values = '?')
data.columns = ['age','workclass','fnlwgt','education','education-num','marital-status',
'occupation','relationship','race','sex','capital-gain','capital-loss',
'hours-per-week','native-country','salary']
data.head(2)
From these columns some of them are either not useful or redundant. Therefore, we remove them from our data:
data.drop(columns = ['fnlwgt', # Not useful
'education', # Redundant
'relationship' # Not useful
], inplace = True # Overwrite data
)
data.head(2)
We can take a look at some descriptive information from the data:
data.describe(include = 'all')
We can also check what data types we have in this data:
data.info()
We see that there are several rows with missing values. We go ahead and drop all the row with any missing values from the data:
data.dropna(inplace = True)
data.info()
There are a few things we need to take care of before we build our model:
To prepare the categorical variables, we use a powerful package called category-encoders:
encoder = BaseNEncoder(cols=['workclass', # Categorical variable to be encoded
'marital-status', # Categorical variable to be encoded
'occupation', # Categorical variable to be encoded
'race', # Categorical variable to be encoded
'sex', # Categorical variable to be encoded
'native-country' # Categorical variable to be encoded
],
base = 3 # Increasing this value will create fewer variables
).fit(data)
df = encoder.transform(data) # We call the transformed data df. From now on, we refer to the data as df
df.head(2)
The next thing we need to do before we build our model is to separate our predictors from our target variable:
X = df.drop(columns = ['salary'])
feature_list = X.columns
y = np.where(df.salary=="<=50K", 0, 1) # Convert the target values to 0s (less than 50K) and 1s (more than 50K)
And finally, we scale the predictors to make them comparable:
scalar = MinMaxScaler() # Define the scaling method
scalar.fit(X) # Fit the scaling methods
X = scalar.transform(X) # Apply the scaling method to X and call it X again
Split the data to Train and Test:
For us to be able to build a generalizable predictive model, we need to devise an unbiased way to evaluate the model. Machine learning models can be quite complex, and it could be hard to understand how they actually learn from the data. One thing that we can be sure about is that if we use a different data to evaluate our model, we can examine to what extent they are generalizable. For instance, if we use a portion of our original data to build/ train the model, and then use another portion of the data to test/ evaluate it, we can check to what extent our model performs well on cases/ samples that it's never seen. Separating the original data to train and test samples is called the hold-out method. We essentially hold a portion of the original data out to be able to use them for testing and evaluating the model. To split the data to train and test samples, we can use scikit-learn:
trainData, testData, trainLabels, testLabels = train_test_split(X,
y,
train_size = .8, # Proportion of train samples
random_state = 1 # We set this to create reproducible results
)
Now, we can go ahead and design our network:
# Clear the previous model:
clear_session()
model = Sequential()
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
Let's go ahead and compile and fit the model:
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(trainData, trainLabels,
epochs=50,
batch_size=50,
verbose=1,
validation_split=0.2)
We can obtain the predictions using the model:
model.summary()
predictions = (model.predict(testData) > 0.5).astype("int32").flatten()
predictionProbabilities = model.predict(testData).flatten()
# Importing function that can be used to calculate different metrics such as accuracy, precision, recall.
from sklearn.metrics import *
def calculateMetricsAndPrint(predictions, predictionsProbabilities, actualLabels):
accuracy = accuracy_score(actualLabels, predictions) * 100
precisionNegative = precision_score(actualLabels, predictions, average = None)[0] * 100
precisionPositive = precision_score(actualLabels, predictions, average = None)[1] * 100
recallNegative = recall_score(actualLabels, predictions, average = None)[0] * 100
recallPositive = recall_score(actualLabels, predictions, average = None)[1] * 100
auc = roc_auc_score(actualLabels, predictionsProbabilities) * 100
print("Accuracy: %.2f\nPrecisionNegative: %.2f\nPrecisionPositive: %.2f\nRecallNegative: %.2f\nRecallPositive: %.2f\nAUC Score: %.2f\n" %
(accuracy, precisionNegative, precisionPositive, recallNegative, recallPositive, auc))
calculateMetricsAndPrint(predictions, predictionProbabilities, testLabels)
def plot_roc_curve(fpr, tpr):
plt.plot(fpr, tpr, color='orange', label='ROC')
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()
pos_probs = predictionProbabilities
fpr, tpr, thresholds = roc_curve(testLabels, pos_probs, pos_label = 1)
# calculate scores
lr_auc = roc_auc_score(testLabels, pos_probs)
print('AUC Score = %.3f' % (lr_auc * 100))
plt.rcParams['figure.figsize'] = [7, 7]
plot_roc_curve(fpr, tpr)